16 research outputs found
Design and implementation of in-network coherence
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Title as it appears in MIT Commencement Exercises program, June 2013: Design and implementation of in-network coherence. Cataloged from PDF version of thesis.Includes bibliographical references (p. 101-104).CMOS technology scaling has enabled increasing transistor density on chip. At the same time, multi-core processors that provide increased performance, vis-a'-vis power efficiency, have become prevalent in a power constrained environment. The shared memory model is a predominant paradigm in such systems, easing programmability and increasing portability. However with memory being shared by an increasing number of cores, a scalable coherence mechanism is imperative for these systems. Snoopy coherence has been a favored coherence scheme owing to its high performance and simplicity. However there are few viable proposals to extend snoopy coherence to unordered interconnects - specifically, modular packet-switched interconnects that have emerged as a scalable solution to the communication challenges in the CMP era. This thesis proposes a distributed in-network global ordering scheme that enables snoopy coherence on unordered interconnects. The proposed scheme is realized on a two-dimensional mesh interconnection network, referred to as OMNI (Ordered Mesh Network Interconnect). OMNI is an enabling solution for the SCORPIO processor prototype developed at MIT - a 36-core chip multi-processor supporting snoopy coherence, and fabricated in a commercial 45nm technology. OMNI is shown to be effective, reducing runtime by 36% in comparison to directory and Hammer coherence protocol implementations. The OMNI network achieves an operating frequency of 833 MHz post-layout, occupies 10% of the chip area, and consumes less than 100mW of power.by Suvinay Subramanian.S.M
FLAT: An Optimized Dataflow for Mitigating Attention Performance Bottlenecks
Attention mechanisms form the backbone of state-of-the-art machine learning
models for a variety of tasks. Deploying them on deep neural network (DNN)
accelerators, however, is prohibitively challenging especially under long
sequences, as this work identifies. This is due to operators in attention
layers exhibiting limited reuse opportunities and quadratic growth in memory
footprint, leading to severe memory-boundedness. To address this, we introduce
a new attention-tailored dataflow, termed FLAT, which identifies fusion
opportunities within the attention layer, and implements an on-chip
memory-aware interleaved execution and tiling mechanism. FLAT increases the
effective memory bandwidth by efficiently utilizing the high-bandwidth,
low-capacity on-chip buffer and thus achieves better run time and compute
resource utilization. In our evaluation, FLAT achieves 1.94x and 1.76x speedup
and 49% and 42% of energy reduction comparing to baseline execution over
state-of-the-art edge and cloud accelerators
A scalable architecture for ordered parallelism
We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified timestamps. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover ordered parallelism. Swarm builds on prior TLS and HTM schemes, and contributes several new techniques that allow it to scale to large core counts and speculation windows, including a new execution model, speculation-aware hardware task management, selective aborts, and scalable ordered commits.
We evaluate Swarm on graph analytics, simulation, and database benchmarks. At 64 cores, Swarm achieves 51--122× speedups over a single-core system, and out-performs software-only parallel algorithms by 3--18×.National Science Foundation (U.S.) (Award CAREER-145299
SMART: A Single-Cycle Reconfigurable NoC for SoC Applications
As technology scales, SoCs are increasing in core counts, leading to the need for scalable NoCs to interconnect the multiple cores on the chip. Given aggressive SoC design targets, NoCs have to deliver low latency, high bandwidth, at low power and area overheads. In this paper, we propose Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) NoC, a NoC that reconfigures and tailors a generic mesh topology for SoC applications at runtime. The heart of our SMART NoC is a novel low-swing clockless repeated link circuit embedded within the router crossbars, that allows packets to potentially bypass all the way from source to destination core within a single clock cycle, without being latched at any intermediate router. Our clockless repeater link has been proven in silicon in 45nm SOI. Results show that at 2GHz, we can traverse 8mm within a single cycle, i.e. 8 hops with 1mm cores. We implement the SMART NoC to layout and show that SMART NoC gives 60% latency savings, and 2.2X power savings compared to a baseline mesh NoC.United States. Defense Advanced Research Projects Agency. The Ubiquitous High-Performance Computing Progra
Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask
Sparsity has become one of the promising methods to compress and accelerate
Deep Neural Networks (DNNs). Among different categories of sparsity, structured
sparsity has gained more attention due to its efficient execution on modern
accelerators. Particularly, N:M sparsity is attractive because there are
already hardware accelerator architectures that can leverage certain forms of
N:M structured sparsity to yield higher compute-efficiency. In this work, we
focus on N:M sparsity and extensively study and evaluate various training
recipes for N:M sparsity in terms of the trade-off between model accuracy and
compute cost (FLOPs). Building upon this study, we propose two new decay-based
pruning methods, namely "pruning mask decay" and "sparse structure decay". Our
evaluations indicate that these proposed methods consistently deliver
state-of-the-art (SOTA) model accuracy, comparable to unstructured sparsity, on
a Transformer-based model for a translation task. The increase in the accuracy
of the sparse model using the new training recipes comes at the cost of
marginal increase in the total training compute (FLOPs).Comment: 11 pages, 2 figures, and 9 tables. Published at the ICML Workshop on
Sparsity in Neural Networks Advancing Understanding and Practice, 2022. First
two authors contributed equall
Penerapan Model Pembelajaran Problem Based Learning (Pbl) untuk Meningkatkan Hasil Belajar Siswa pada Pokok Bahasan Kelarutan dan Hasil Kali Kelarutan di Kelas XI IPA SMA Negeri 1 Kampar
Research was aimed to improve study result of student on solubility and result times solubility subject has been doing in class XI science SMAN 1 Kampar. This research was a form of experiments research with design pretest-posttest. Sample of the research were student of class XI science 2 as experimental class and class XI science 3 as control class. The experimental class was applied learning model problem base learning while the control class using discussion method. Data were analized using t- test. Result from the data analysis showed t count > t table (1,6923 > 1,68). It means learning model problem based learning can increase study result of student on solubility and result times solubility subject in class XI science SMAN 1 Kampar with category of increase study results on the solubility and result times solubility in class XI science is high category
SCORPIO: A 36-Core Research Chip Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In-Network Ordering
URL to conference programIn the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy coherence is common in small multicore systems, directory-based coherence is the de facto choice for scalability to many cores, as snoopy relies on ordered interconnects which do not scale. However, directory-based coherence does not scale beyond tens of cores due to excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering on arbitrary unordered networks are impractical for full multicore chip designs. We present SCORPIO, an ordered mesh Network-on-Chip(NoC) architecture with a separate fixed-latency, bufferless network to achieve distributed global ordering. Message delivery is decoupled from the ordering, allowing messages to arrive in any order and at any time, and still be correctly ordered. The architecture is designed to plug-and-play with existing multicore IP and with practicality, timing, area, and power as top concerns. Full-system 36 and 64-core simulations on SPLASH-2 and PARSEC benchmarks show an average application run time reduction of 24.1% and 12.9%, in comparison to distributed directory and AMD HyperTransport coherence protocols, respectively. The SCORPIO architecture is incorporated in an 11 mm-by- 13 mm chip prototype, fabricated in IBM 45nm SOI technology, comprising 36 Freescale e200 Power Architecture TM cores with private L1 and L2 caches interfacing with the NoC via ARM AMBA, along with two Cadence on-chip DDR2 controllers. The chip prototype achieves a post synthesis operating frequency of 1 GHz (833 MHz post-layout) with an estimated power of 28.8 W (768 mW per tile), while the network consumes only 10% of tile area and 19 % of tile power.United States. Defense Advanced Research Projects Agency (DARPA UHPC grant at MIT (Angstrom))Center for Future Architectures ResearchMicroelectronics Advanced Research Corporation (MARCO)Semiconductor Research Corporatio
TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings
In response to innovations in machine learning (ML) models, production
workloads changed radically and rapidly. TPU v4 is the fifth Google domain
specific architecture (DSA) and its third supercomputer for such ML models.
Optical circuit switches (OCSes) dynamically reconfigure its interconnect
topology to improve scale, availability, utilization, modularity, deployment,
security, power, and performance; users can pick a twisted 3D torus topology if
desired. Much cheaper, lower power, and faster than Infiniband, OCSes and
underlying optical components are <5% of system cost and <3% of system power.
Each TPU v4 includes SparseCores, dataflow processors that accelerate models
that rely on embeddings by 5x-7x yet use only 5% of die area and power.
Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves
performance/Watt by 2.7x. The TPU v4 supercomputer is 4x larger at 4096 chips
and thus ~10x faster overall, which along with OCS flexibility helps large
language models. For similar sized systems, it is ~4.3x-4.5x faster than the
Graphcore IPU Bow and is 1.2x-1.7x faster and uses 1.3x-1.9x less power than
the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers
of Google Cloud use ~3x less energy and produce ~20x less CO2e than
contemporary DSAs in a typical on-premise data center.Comment: 15 pages; 16 figures; to be published at ISCA 2023 (the International
Symposium on Computer Architecture
JaxPruner: A concise library for sparsity research
This paper introduces JaxPruner, an open-source JAX-based pruning and sparse
training library for machine learning research. JaxPruner aims to accelerate
research on sparse neural networks by providing concise implementations of
popular pruning and sparse training algorithms with minimal memory and latency
overhead. Algorithms implemented in JaxPruner use a common API and work
seamlessly with the popular optimization library Optax, which, in turn, enables
easy integration with existing JAX based libraries. We demonstrate this ease of
integration by providing examples in four different codebases: Scenic, t5x,
Dopamine and FedJAX and provide baseline experiments on popular benchmarks.Comment: Jaxpruner is hosted at http://github.com/google-research/jaxprune